Automatic Extraction Of Systematic Polysemy Using Tree-Cut

نویسنده

  • Noriko Tomuro
چکیده

This paper describes an automat ic method for extracting systematic polysemy from a hierarchically organized semantic lexicon (WordNet). Systematic polysemy is a set of word senses that are related in systematic and predictable ways. Our method uses a modification of a tree generalization technique used in (Li and Abe, 1998), and generates a tree-cut, which is a list of clusters that parti t ion a tree. We compare the systematic relations extracted by our automatic method to manually extracted WordNet cousins. 1 I n t r o d u c t i o n In recent years, several on-line broad-coverage semantic lexicons became available, including L D O C E (Procter, 1978), WordNet (Miller, 1990) and H E C T O R .(Kilgarriff, 1998a). These lexicons have been used as a domainindependent semantic resource as well as an evaluation criteria in various Natural Language Processing (NLP) tasks, such as Information Retrieval (IR), Information Extraction (IE) and Word Sense Disambiguation (WSD). However, those lexicons are rather complex. For instance, WordNet (version 1.6) contains a total of over 120,000 words and 170,000 word senses, which are grouped into around 100,000 synsets (synonym sets). In addit ion to the size, word entries in those lexicon are often polysemous. For instance, 20% of the words in Wordnet have more than one sense, and the average number of senses of those polysemous words is around 3. Also, the distinction between word senses tends to be ambiguous and arbitrary. For example, the following 6 senses are listed in WordNet for the noun "door": 1. d o o r a swinging or sliding barrier 2. d o o r the space in a wall 3. d o o r anything providing a means of access (or escape) 4. d o o r a swinging or sliding barrier that will close off access into a car 5. d o o r a house that is entered via a door 6. d o o r a room that is entered via a door Because of the high degree of ambiguity, using such complex semantic lexicons brings some serious problems to the performance of NLP systems. The first, obvious problem is the computat ional intractability: increased processing time needed to disambiguate multiple possibilities will necessarily slow down the system. Another problem, which has been receiving attention in the past few years, is the inaccuracy: when there is more than one sense applicable in a given context, different systems (or human individuals) may select different senses as the correct sense. Indeed, recent studies in WSD show that, when sense definitions are fine-grained, similar senses become indistinguishable to human annotators and often cause disagreement on the correct tag (Ng et al., 1999; Veronis, 1998; Kilgarriff, 1998b). Also in IR and IE tasks, difference in the correct sense assignment will surely degrade recall and precision of the systems. Thus, it is apparent that , in order for a lexicon to be useful as an evaluation criteria for NLP systems, it must represent word senses at the level of granularity that captures human intuition. In Lexical Semantics, several approaches have been proposed which organize a lexicon based on systematic polysemy: 1 a set of word senses that are related in systematic and predictable ISystematic polysemy (in the sense we use in this paper) is also referred to as regular polysemy (Apresjan, 1973) or logical polyseray (Pustejovsky, 1995).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tree-Cut and a Lexicon Based on Systematic Polysemy

This paper describes a lexicon organized around systematic polysemy: a set of word senses that are related in systematic and predictable ways. The lexicon is derived by a fully automatic extraction method which utilizes a clustering technique called tree-cut. We compare our lexicon to WordNet cousins, and the inter-annotator disagreement observed between WordNet Semcor and DSO corpora.

متن کامل

Automatic Biomedical Term Polysemy Detection

Polysemy is the capacity for a word to have multiple meanings. Polysemy detection is a first step for Word Sense Induction (WSI), which allows to find different meanings for a term. The polysemy detection is also important for information extraction (IE) systems. In addition, the polysemy detection is important for building/enriching terminologies and ontologies. In this paper, we present a nov...

متن کامل

Information Extraction Using Metadata andSolving Polysemy Problems

Data mining is the exploration and evaluation of large quantity of data to discover substantial, novel, useful and effectively understandable data. Hence determining the knowledge of a document becomes a necessary task in data mining. There are three approaches of metadata in general. They are stylistic, machine learning and knowledge bases. Sometimes the problem occurs when mining a document t...

متن کامل

Incremental Knowledge Acquisition from WordNet and EuroWordNet

This paper describes the process of the creation and extraction of implicit knowledge from WordNet (Fellbaum, 1998) and EuroWordNet (Vossen, 1998). This knowledge is an extension of the explicit knowledge structures already provided by the wordnets in the form of synsets and semantic relations, and is contained both within (Euro)WordNet’s hierarchical structure and the glosses that are associat...

متن کامل

Automatic Lane Extraction in Hemoglobin and Serum Protein Electrophoresis Using Image Processing

Image analysis is an image processing technique that aims to extract features or information from images. Image analysis in medicine has a special place because is a basis for disease diagnosis for physicians. Electrophoresis is a laboratory separating technique. Electrophoresis images are created during the electrophoresis process. Serum protein and hemoglobin electrophoresis test are the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000